1 Introduction

PathAnalyser is an intuitive and user-friendly package that allows users to assess pathway activity of transcriptomic data sets using a pathway gene signature, a list of genes whose expression varies significantly when the corresponding is active. PathAnalyser uses GSVA to estimate expression abundance of the up-regulated and down-regulated gene sets of the gene signature for each sample in a transcriptomic data set, and then ranks samples based on expression abundance of the up-regulated and down-regulated gene sets of the gene signature separately. The classification algorithm uses user-configurable thresholds to tailor the pathway-based classification to their own classification needs. Unlike other available pathway assessment packages which do not distinguish between up-regulated and down-regulated gene sets, the PathAnalyser classification algorithm considers both parts of a gene signature: the up-regulated and down-regulated genes separately in assessing consistency of expression with the gene signature.

Other features unique to PathAnalyser package include:

  1. Multiple samples are considered for the classification analysis, building upon the sample-wise analysis provided by the GSVA package

  2. Unlike other classification algorithms, PathAnalyser does not directly force classification of samples into active or inactive pathway classes

  3. Detailed evaluation of the pathway-based sample classification including descriptive statistics such as accuracy, percentage classified, recall and ROC curve visualization

  1. Classification visualization - PCA plots can be generated for visualization of the predicted pathway classes (active/inactive) for each sample.

PathAnalyser provides functionality and built-in data (including expression data set as well as gene signatures) for assessing ER and HER2 pathway activity in breast cancer transcriptomic data sets. These transcriptional signatures have been shown to have clinical predictive value, in particular, ER and HER2 gene signatures have been associated with molecular sub-types, prognosis and treatment response in breast cancer. Although this package was original developed to assess ER and HER2 pathway activity in breast cancer transcriptomic data sets, the package functionality could be applied to transcriptomic data sets and gene signatures outside the context of breast cancer.

In this vignette, we will describe how to use the PathAnalyser package with microarray and RNA-seq expression transcriptomic data sets and gene expression signatures associated with a specific pathway activity.

2 General Workflow

A flow chart of the general workflow of the analysis conducted using PathAnalyser is illustrated below:

PathAnalyser workflow

Figure 1: PathAnalyser workflow

Note: The classification evaluation step is optional and can only be performed if actual pathway activity class labels (e.g. active, inactive or uncertain) for the corresponding pathway are present.

2.1 Overview of PathAnalyser functionality

The PathAnalyser package functionality can be sub-divided into 5 broad categories corresponding to a typical workflow for assessing pathway activity of samples:

  1. Input functions for expression data and gene signatures
  2. Quality control and pre-processing of input data
  3. Classification of samples by pathway activity
  4. Evaluation of classification
  5. Visualization

3 Installation and setup

3.1 Installing dependencies

To install all required dependencies of PathAnalyser, type the following in R:

# If not already installed
install.packages("BiocManager")

BiocManager::install(c("GSVA", "pROC", "edgeR", "reshape2", "ggplot2","limma", 
                       "VennDiagram", "NCmisc", "futile.logger"), 
                     dependencies = TRUE)

3.2 Installing package using devtools

The easiest way to install PathAnalyser is to install directly from GitHub using the devtools package in R:

install.packages("devtools")
devtools::install_github("ozlemkaradeniz/PathAnalyser")

Alternatively, you can clone the GitHub repository in a Linux / MacOS terminal:

git clone git@github.com:a-thind/PathAnalyser.git

Then type the following in R to install a local source version of the package:

library(utils)
install.packages("~/PathAnalyser", repo=NULL, type="source")

Once installation is completed, load the library to start using PathAnalyser:

library(PathAnalyser)

4 Data formats

The classification algorithm of PathAnalyser requires two types of input data:

  1. Gene expression matrix - a gene-by-sample gene expression matrix, with row names corresponding to gene names and column names corresponding to sample names / IDs.
  2. gene signature data frame - a data frame of expression values corresponding to a list of genes associated with a pathway. The first column contains gene names and the second column contains expression values: -1 for down-regulated or 1 for up-regulated genes when a specific pathway is active.

4.1 Input files

The two data types described above (gene expression matrix and gene signature dataframe) can be created from two types of input files:

  1. Gene expression matrix file - a tab-delimited text file containing a gene expression matrix, where rows represent genes and columns samples of a gene expression data set. The gene expression matrix text file can contain expression values from RNASeq or normalized microarray data. The image below is a snapshot of an example expression data set file:
knitr::include_graphics("expr_dataset_example.png")
An example gene expression matrix text file

Figure 2: An example gene expression matrix text file

The first line contains the sample names and the first column contains gene names. All other tab-separated fields represent expression values for each gene for each sample from either RNASeq or microarray experiments.

Note: PathAnalyser does not provide functionality for unnormalized microarray data, so the user must ensure the microarray data is normalized prior to performing classification by PathAnalyser.

  1. Gene signature files - this data is provided as two text files containing a list of genes provided in the gene set file format (using .grp file extension) - one for the up-regulated gene set and another containing the down-regulated gene set of the pathway signature. A plethora of pathway gene signatures in the gene set file format can be found at MSigDB. The snapshot below shows the first 30 lines of the gene set file for the up-regulated gene set of the ERBB2 (HER2) gene signature.
knitr::include_graphics("up_gene_signature.png")
An example gene set file format containing either the up- or down-regulated gene sets of the gene signature

Figure 3: An example gene set file format containing either the up- or down-regulated gene sets of the gene signature

Note: read_signature_data from PathAnalyser only accepts gene signatures that have the up-regulated and down-regulated gene sets in separate gene set format text files (with file extension .grp). The user is expected to provide one up-regulated and one down-regulated gene set file for a pathway gene signature.

5 Built-in datasets and gene signatures

5.1 Gene expression signatures

PathAnalyser contains two gene signatures for active ER and HER2 pathways that can be used to assess ER and HER2 pathway activity in a breast cancer transcriptomic data set. These signatures are accessible by typing the following in R:

data("ER_sig_df")
data("HER2_sig_df")

5.1.1 ER signature

The built-in ER signature ER_sig_df is a data frame containing the names of 160 differentially expressed genes which constitute the transcriptional signature of ER pathway activation. This signature was obtained from the sensitivity to endocrine therapy genome index defined by Symanns et al. (2011). The total number of genes that are up-regulated (denoted with an expression value of 1) and down-regulated (denoted with an expression value of -1) in this gene signature are 101 and 59 genes respectively.

dim(ER_sig_df)
#> [1] 160   2
head(ER_sig_df)
#>     gene expression
#> 1   ABAT          1
#> 2 ACADSB          1
#> 3  ADCY1          1
#> 4  ADCY9          1
#> 5   AGR2          1
#> 6  ANXA9          1
# up-regulated genes are given an expression value of 1
ER_up_sig <- ER_sig_df[ER_sig_df$expression == 1 ,]
dim(ER_up_sig)
#> [1] 101   2
# down-regulated genes are given an expression value of -1
ER_dn_sig <- ER_sig_df[ER_sig_df$expression == -1 ,]
dim(ER_dn_sig)
#> [1] 59  2
# for more details
?ER_sig_df

5.1.2 HER2 signature

A gene expression signature for HER2 (ERBB2) pathway activation HER2_sig_df is also provided by PathAnalyser. This gene signature was obtained from Molecular Signatures Database (MSigDB) by combining the up-regulated (gene set: SMID_BREAST_CANCER_ERBB2_UP ) and down-regulated (gene set: SMID_BREAST_CANCER_ERBB2_DN ) gene sets available at MSigDB, originating from transcriptomic profiling of a 344 primary breast tumor samples in a study by Smid et al. 2010 in a study examining the subtypes of breast cancer and organ-specifc relapse.

The names of 156 differentially expressed genes are in the HER2 signature data frame provided by PathAnalyser, of which 190 are up-regulated (denoted by an expression of 1) and 197 are down-regulated (denoted by an expression of -1).

dim(HER2_sig_df)
#> [1] 156   2
head(HER2_sig_df)
#>    genes expression
#> 1 ABCA12          1
#> 2  ABCC2          1
#> 3  ABCC3          1
#> 4  ABCC6          1
#> 5   ACE2          1
#> 6  ACSL1          1
# up-regulated genes are given an expression value of 1
HER2_up_sig <- HER2_sig_df[HER2_sig_df$expression == 1 ,]
dim(HER2_up_sig)
#> [1] 151   2
# Down-regulated genes are given an expression value of -1
HER2_dn_sig <- HER2_sig_df[HER2_sig_df$expression == -1 ,]
dim(HER2_dn_sig)
#> [1] 5 2
# for more details
?HER2_sig_df

5.2 Gene expression datasets

PathAnalyser also contains two built-in gene expression matrices each containing RNA-seq raw read counts for primary breast cancer samples obtained from 60 individuals (cases). Data for these matrices were obtained from The Cancer Genome Atlas (TCGA). Each expression matrix contains 20,124 genes.

data("ER_data_se1")
data("HER2_data_se1")
# column names represent case (sample) IDs from TCGA

5.2.1 ER dataset 1

The ER data set ER_data_se1 contains RNASeq raw read counts for 60 primary breast cancer samples, 30 of which have ER pathway activity (ER+) and 30 which have inactive ER pathway activity (ER-).

dim(ER_data_se1)
#> [1] 20124    60
# Expression data for first 6 genes
head(ER_data_se1)
#>         TCGA.A8.A08X TCGA.AR.A0TY TCGA.AN.A0FV TCGA.AO.A124 TCGA.A2.A0D3
#> OR5H15             0            0            0            0            0
#> SPANXN2            0            0            0            0            0
#> TAF7            2512        16841         6985         6974         6431
#> TMEM107          384          189          703          851         1054
#> CAMK2A            44           10           72           20            7
#> TRAM1L1           71          233          622          198          483
#>         TCGA.AN.A0XT TCGA.BH.A0WA TCGA.AO.A0J8 TCGA.LD.A9QF TCGA.C8.A26Y
#> OR5H15             0            0            0            0            0
#> SPANXN2            0            0            0            0            0
#> TAF7            9864         5267        12210         4838         8627
#> TMEM107          478          870         1298          364         1599
#> CAMK2A            12           37           19           89           35
#> TRAM1L1          218          305         1044          291          295
#>         TCGA.E2.A1B5 TCGA.C8.A27B TCGA.A7.A5ZX TCGA.AR.A5QQ TCGA.A8.A096
#> OR5H15             0            0            0            0            0
#> SPANXN2            0            0            0            0            0
#> TAF7           11152         8510         5995         5264        13536
#> TMEM107          456          506          615          429         1353
#> CAMK2A             8            9           30           10           74
#> TRAM1L1          349          119          346          154          928
#>         TCGA.EW.A1P4 TCGA.AN.A0XU TCGA.A7.A6VW TCGA.BH.A1F8 TCGA.D8.A1XZ
#> OR5H15             0            0            0            0            0
#> SPANXN2            0            0            0            0            0
#> TAF7            6181         6455         3796        17640         8627
#> TMEM107          309          796          452          433          629
#> CAMK2A            13           16           94           91           30
#> TRAM1L1          133          362          119          651          385
#>         TCGA.D8.A1X8 TCGA.E2.A1LB TCGA.LL.A8F5 TCGA.E9.A1N8 TCGA.BH.A202
#> OR5H15             0            0            0            0            0
#> SPANXN2            0            0            0            0            0
#> TAF7            4124        11084         7663         4195         6741
#> TMEM107          653          307         1265          312          733
#> CAMK2A            23           16           45           11           12
#> TRAM1L1          871          285          342          248          451
#>         TCGA.AC.A3TN TCGA.AC.A3QQ TCGA.E2.A108 TCGA.E2.A14X TCGA.AR.A256
#> OR5H15             0            0            0            0            0
#> SPANXN2            0            0            0            0            0
#> TAF7            2869          718         4147         6440        10784
#> TMEM107          660          230          475          572          319
#> CAMK2A            34           25           15           14           32
#> TRAM1L1          482           28          157          249          507
#>         TCGA.A2.A1G4 TCGA.A2.A0YF TCGA.AC.A23C TCGA.E2.A15F TCGA.D8.A27I
#> OR5H15             0            0            0            0            0
#> SPANXN2            0            0            0            0            0
#> TAF7           10798        14649        11091         7106         9755
#> TMEM107          920         1738          418          319          838
#> CAMK2A             8            4           87           48           56
#> TRAM1L1          610          428          473          507          773
#>         TCGA.BH.A42U TCGA.AO.A1KQ TCGA.BH.A0E0 TCGA.D8.A1JS TCGA.AC.A6NO
#> OR5H15             0            0            0            0            0
#> SPANXN2            0            0            0            0            0
#> TAF7            9775         6351         3150         8096         5839
#> TMEM107          785          633          234         1472         1138
#> CAMK2A            34           32           34            4           53
#> TRAM1L1          306          169           71          448          328
#>         TCGA.A8.A0A7 TCGA.A1.A0SK TCGA.AN.A04D TCGA.5L.AAT0 TCGA.C8.A12N
#> OR5H15             0            0            0            0            0
#> SPANXN2            0            0            0            0            0
#> TAF7            4265         5306         7747         4839         6924
#> TMEM107          432          748          268          284          999
#> CAMK2A            21           12            4           43           44
#> TRAM1L1          169          326          787          272          285
#>         TCGA.AR.A24Z TCGA.AC.A2QH TCGA.D8.A13Z TCGA.E2.A1LL TCGA.B6.A402
#> OR5H15             0            0            0            0            0
#> SPANXN2            0            0            0            0            0
#> TAF7           10185         1001         6308         3424         4534
#> TMEM107          389          156          540          171          411
#> CAMK2A            19           33           77            4           35
#> TRAM1L1         1103           11          221           95          514
#>         TCGA.EW.A6SB TCGA.A1.A0SD TCGA.A2.A0T2 TCGA.D8.A1JP TCGA.A2.A0D2
#> OR5H15             0            0            0            0            0
#> SPANXN2            0            0            0            0            0
#> TAF7            3113         9669         4671         3934         4544
#> TMEM107          304          432          922          419          501
#> CAMK2A             2           41            6           29           39
#> TRAM1L1           54          578          158          144          279
#>         TCGA.BH.A0DK TCGA.BH.A1EW TCGA.A8.A091 TCGA.A7.A26G TCGA.HN.A2NL
#> OR5H15             0            0            0            0            0
#> SPANXN2            0            0            0            0            0
#> TAF7            8735         7793        11630         5765         5643
#> TMEM107          205          408         1353          427         1013
#> CAMK2A            15           12           25           88            5
#> TRAM1L1          240          292         1146          376          510

5.2.2 HER2 dataset 1

Similarly, the HER2 data set HER2_data_se1 contains RNASeq raw read counts for 60 primary breast cancer samples, 30 of which have HER2 (ERBB2) pathway activity (HER2+) and 30 which have inactive HER2 pathway activity (HER2-).

dim(HER2_data_se1)
#> [1] 20124    60
# Expression data for first 6 genes
head(HER2_data_se1)
#>         TCGA.A8.A090 TCGA.LD.A66U TCGA.AN.A0FJ TCGA.E9.A1ND TCGA.D8.A1XT
#> OR5H15             0            0            0            0            0
#> SPANXN2            0            0            0            0            0
#> TAF7            4404         8948         2729         5195         6094
#> TMEM107          274         1061          264          291          553
#> CAMK2A            42           17            3           12          115
#> TRAM1L1          698          549            8          208          397
#>         TCGA.BH.A0AU TCGA.AR.A1AY TCGA.AN.A04C TCGA.A7.A3J1 TCGA.BH.A0DP
#> OR5H15             0            0            0            0            0
#> SPANXN2            0            0            0            0            0
#> TAF7            3614         2844         6280         6257         4497
#> TMEM107          171          512          316         1090          495
#> CAMK2A            30           24            9           38           42
#> TRAM1L1          514          174          212          578          498
#>         TCGA.BH.A0HX TCGA.AN.A0XR TCGA.C8.A1HM TCGA.A2.A3XT TCGA.AN.A0FZ
#> OR5H15             0            0            0            0            0
#> SPANXN2            0            0            0            0            0
#> TAF7            7876         9083         6380         9376        15458
#> TMEM107          519          500          428          405          656
#> CAMK2A             8           18           13           60           41
#> TRAM1L1          538          735           84          271           66
#>         TCGA.E9.A22D TCGA.D8.A1XY TCGA.A8.A07E TCGA.AN.A0FS TCGA.E9.A1N9
#> OR5H15             0            0            0            0            0
#> SPANXN2            0            0            0            0            0
#> TAF7           14990        11480         9334         8879         3179
#> TMEM107         1058          387          496          894          150
#> CAMK2A             9           87           62           25           12
#> TRAM1L1          776          816          315         1217          128
#>         TCGA.AO.A0JE TCGA.BH.A0B6 TCGA.A8.A09G TCGA.C8.A12V TCGA.AR.A24W
#> OR5H15             0            0            0            0            0
#> SPANXN2            0            0            0            0            0
#> TAF7            9138         5640        10607         6867         9128
#> TMEM107          253          336          637          385          578
#> CAMK2A            47            5           11           18           29
#> TRAM1L1          467          449          309           33          382
#>         TCGA.AN.A0XV TCGA.C8.A12O TCGA.C8.A1HG TCGA.AN.A041 TCGA.A2.A25F
#> OR5H15             0            0            0            0            0
#> SPANXN2            0            0            0            0            1
#> TAF7            3110         5370         3267         6425         9156
#> TMEM107          742          532          326          595          592
#> CAMK2A            36          109            8           90           60
#> TRAM1L1          693          195          238          285          206
#>         TCGA.A8.A09B TCGA.A2.A0YF TCGA.BH.A18P TCGA.AC.A23C TCGA.LL.A9Q3
#> OR5H15             0            0            0            0            0
#> SPANXN2            0            0            0            0            0
#> TAF7            7051        14649         7683        11091         5873
#> TMEM107          848         1738          239          418          168
#> CAMK2A            42            4           32           87           13
#> TRAM1L1          478          428          510          473           77
#>         TCGA.UU.A93S TCGA.C8.A137 TCGA.EW.A3E8 TCGA.D8.A1XO TCGA.AN.A0FT
#> OR5H15             0            0            0            0            0
#> SPANXN2            1            0            0            0            0
#> TAF7            3770         6161        11883        10467         4009
#> TMEM107          468          142          608         1015          528
#> CAMK2A             4           33           32           34           43
#> TRAM1L1           83          197          733          544          180
#>         TCGA.BH.A0B9 TCGA.A8.A09X TCGA.A1.A0SK TCGA.A8.A097 TCGA.BH.A18Q
#> OR5H15             0            0            0            0            0
#> SPANXN2            0            0            0            0            0
#> TAF7           15277         5805         5306        14549         6441
#> TMEM107          582          172          748         1228          660
#> CAMK2A           301           35           12          117           19
#> TRAM1L1          653           61          326          767          266
#>         TCGA.GM.A2DO TCGA.A7.A4SE TCGA.E9.A22E TCGA.C8.A12N TCGA.E2.A14T
#> OR5H15             0            0            0            0            0
#> SPANXN2            0            0            0            0            0
#> TAF7            2700         3015        12731         6924        12687
#> TMEM107          256          491          554          999          836
#> CAMK2A             3           72           77           44           18
#> TRAM1L1           56           90          231          285          443
#>         TCGA.C8.A1HK TCGA.AC.A23H TCGA.BH.A18M TCGA.AO.A03T TCGA.A8.A0A4
#> OR5H15             0            0            0            0            2
#> SPANXN2            0            0            0            1            1
#> TAF7           10020         8397         4651         2198         8487
#> TMEM107          149          367          509          358          321
#> CAMK2A            15           50           38          704           15
#> TRAM1L1          335          128          306          267          648
#>         TCGA.A2.A04X TCGA.E2.A15P TCGA.AO.A0JM TCGA.E2.A14R TCGA.A2.A0D2
#> OR5H15             0            0            0            0            0
#> SPANXN2            0            0            0            0            0
#> TAF7            7940         9123         9378         4787         4544
#> TMEM107          852         1406          294          658          501
#> CAMK2A            19           47           66           15           39
#> TRAM1L1          330         1081          394           15          279

6 Reading input data using PathAnalyser

There are two types of input required for pathway activity assessment using PathAnalyser:

  1. gene expression data matrices
  2. gene expression signatures

6.1 Gene expression data

The read_expression_data function reads gene expression data from
a file with an delimiter, the function checks the delimiter of the file and reads the file accordingly. It returns an output with gene symbols/IDs as the row names and columns representing each sample IDs in the dataset.

data_se <- read_expression_data("/Users/taniyapal/Documents/Group Project/TCGA_unannotated.txt")
data_se<-data_se[,1:200]
print(data_se[1:5, 1:5])

6.2 Gene signature data

Gene signature files can be read by the read_signature_data function. It returns a dataframe with signature IDs/symbols in the first column and their expression pattern is represented as up r regulated or down regulated by +1 and -1 respectively.

sig_df <- read_signature_data("/Users/taniyapal/Documents/Group Project/ESR1_UP.v1._UP.csv", "/Users/taniyapal/Documents/Group Project/ESR1_DN.v1_DN.csv")
head(sig_df)

6.3 Quality control and pre-processing of data

6.3.1 Log Counts Per Million (CPM) transformation of RNASeq expression matrices

The log_cpm_transformation function transforms the raw data by the method of log CPM and returns the transformed data. It also performs sanity check of the transformation by returning box plots for visualization of distribution of data before and after transformation.

# using toy data set as expression matrix
data("ER_data_se1")
data_se <- ER_data_se1 
normalized_se <- log_cpm_transformation(data_se)

6.3.2 Checking and filtering genes from the gene expression matrix

The gene names are first checked for inclusion in the gene signature data frame. Genes that are included in the gene signature are retained then subjected to further filtration, in which their expression across the data set measured. If a gene is not present in at least 10% of the total number of samples in the data set, the gene is dropped from the expression matrix. A console message is printed reporting the number of features (genes) retained in the final normalized expression matrix.

normalized_se <- check_signature_vs_dataset(normalized_se, ER_sig_df)
#> Number of feature present in expression dataset: 146

7 Classifying samples by pathway activity using gene signatures

7.1 Overview of pathway activity classification

PathAnalyser provides assessment of pathway activity for each sample in a gene expression data set, by extending an existing unsupervised and non-parametric sample-wise gene set enrichment method, Gene set variation analysis (GSVA). GSVA measures the variation of gene set enrichment across a sample population in a manner independent of class labels, and summarizes the sample-wise expression of the gene set in the form of gene set enrichment scores for each sample (Hänzelmann et al. 2013). However, unlike GSVA, which does not distinguish up-regulated gene sets from down-regulated gene sets, PathAnalyser considers the expression of the up-regulated and down-regulated gene sets of a given pathway’s gene signature both separately and explicitly, using the GSVA method solely for estimating the relative abundance of expression of the up- and down-regulated gene sets for a given pathway in each sample.

7.2 Overall PathAnalyser classification algorithm

The estimated expression abundance for both these gene sets is used to check expression consistency with the gene signature and infer pathway activity for a given sample in the transcriptomic data set.

The classification algorithm implemented in PathAnalyser uses four thresholds for the classification algorithm (two for each gene set of the pathway signature):

  1. High expression abundance for up-regulated gene set
  2. Low expression abundance for up-regulated gene set
  3. High expression abundance for down-regulated gene set
  4. Low expression abundance for down-regulated gene set

The diagram below summarises the classification algorithm employed by PathAnalyser to classify samples by pathway activity. We include an external image with the R function:

Overiew of algorithm diagram

Figure 4: Overiew of algorithm diagram

Samples are ordered by expression abundance of up-regulated and down-regulated genes of the gene signature separately using the GSVA scores. Those samples which have the most abundant expression of the up-regulated gene set (exceeding threshold 1) show evidence of activation of the pathway but are only classified as having the pathway “Active”, if the same samples show a low abundance of expression of the down-regulated genes (below threshold 4) of the gene signature i.e. demonstrating consistency between evidence from up- and down-regulated gene sets of the gene signature. Similarly, samples with evidence of pathway inactivation from the up-regulated gene set of the gene signature i.e. the least abundant expression of the up-regulated genes of the gene signature (below threshold 2) and which also show evidence of pathway inactivation from the down-regulated gene set of the signature i.e.  most abundance expression of the down-regulated gene set of the gene signature (exceeding threshold 3), demonstrate evidence of consistency of pathway inactivation from both gene sets and therefore are classified as “Inactive”. All other samples are classified as “Uncertain” since there is insufficient evidence for pathway activation or inactivation.

In PathAnalyser, the GSVA scores thresholds for classifying a sample as “Active” (consistent with the gene signature) and “Inactive” (inconsistent with the gene signature) can be set manually and depend on the user’s pathway activity classification requirements.

PathAnalyser provides two functions for classifying samples using an absolute GSVA scores thresholds classify_GSVA_abs or percentile thresholds which is effectively rank-based classify_GSVA_percent using percentiles (default at 25%) , as well as a visualization function to view the distributions of the GSVA scores for both gene sets gsva_scores_dist, which will be discussed below.

7.3 Visualising GSVA Score Distribution

PathAnalyser provides gsva_scores_dist method to visualize the GSVA score distribution for the abundance of expression of the up-regulated and down-regulated gene sets of the gene signature for each sample. The resulting density plot usually follows a bimodal distribution of GSVA scores for each sample for the up-regulated and down-regulated gene sets, a consequence of the intrinsic binaric nature of gene expression in datasets, where samples are often either have a relatively high expression abundance of a gene set, or relatively low expression abundance of a gene set resulting in two “sub-populations” of samples.

Figure 5 shows two peaks for each gene set when running gsva_scores_dist with logCPM-normalized ER_data_se1 gene expression matrix. For both gene sets, the peak with the highest GSVA scores represents samples with high abundance of expression of the gene set and the lower score peak represents samples with low abundance of expression of the gene set.

gsva_scores_dist(ER_sig_df, normalized_se)
Density plots showing the distribution of GSVA scores for samples in logCPM-normalised `ER_data_se1` data set for the up-regulated (left) and down-regulated gene set (right) of the pathway signature using `gsva_scores_dist` function.

Figure 5: Density plots showing the distribution of GSVA scores for samples in logCPM-normalised ER_data_se1 data set for the up-regulated (left) and down-regulated gene set (right) of the pathway signature using gsva_scores_dist function

7.3.1 Determining thresolds for classification

As the two peaks represent represent low expression abundance and high abundance expression, the distribution of GSVA scores that constitute the “valley” i.e. the area between the two peaks (fig. 6), represent samples that exhibit relatively weaker evidence of differential expression abundance relative to other samples and the algorithm would classify these samples as having “Uncertain” pathway activity. Because the peaks represent the mode GSVA scores (one for high and low expression abundance), thresholds should be selected between these two peak to enable the algorithm to classify samples around these modes as being consistent or inconsistent with the pathway signature.

As mentioned above, there are four thresholds that can be tuned by the user: the high and low expression abundance for the up-regulated and down-regulated gene sets. For example, a user could set the high expression abundance thresholds for the up-regulated gene set to -0.2 for the low expression threshold for both gene sets, and 0.2 for the high expression thresholds for both gene sets. Samples with GSVA scores between the these thresholds for both the up-regulated and down -regulated gene-set would be considered by the classification algorithm as having “Uncertain” pathway activity.

plot <- gsva_scores_dist(ER_sig_df, normalized_se)
# Add thresholds on plot
library(ggplot2)
data_threshs <- data.frame(Geneset=c("Up", "Down"), vline=c(-0.2, 0.2))
plot + geom_vline(xintercept=data_threshs$vline, linetype=2)
GSVA scores distributions of the samples for the down-regulated (left) and up-regulated gene set (right) of the pathway gene signature, showing absolute thresholds of -0.2 and 0.2 for characterising low and high expression abundance of both gene sets respectively. The expression data set used is the logCPM-normalised `ER_data_se1` gene expression data set.

Figure 6: GSVA scores distributions of the samples for the down-regulated (left) and up-regulated gene set (right) of the pathway gene signature, showing absolute thresholds of -0.2 and 0.2 for characterising low and high expression abundance of both gene sets respectively
The expression data set used is the logCPM-normalised ER_data_se1 gene expression data set.

Shrinking the distance between the high and low expression thresholds would result in fewer “Uncertain” pathway activity classifications (fig. 7), because more samples would meet the consistency/ inconsistency expression thresholds for the pathway signature (high expression thresholds would be lower, low expression thresholds would be higher for each gene set).

# more relaxed thresholds, fewer uncertain labels
data_threshs <- data.frame(Geneset=c("Up", "Down"), 
                           vline=c(-0.1, 0.1))
plot + geom_vline(xintercept=data_threshs$vline, linetype=2)
GSVA scores distributions of samples with more relaxed thresholds for assessing expression consistency of the down-regulated gene set (left) and the up-regulated gene set (right) with the pathway gene expression signature. Low and high expression abundance thresholds for the down-regulated gene set (left) are set to -0.1 and 0.1 respectively, while low and high expression abundance thresholds for the up-regulated gene set (right) are also set to -0.1 and 0.1 respectively. Data is generated from running GSVA on logCPM-normalised `ER_data_se1` expression dataset with ER signature (`ER_sig_df`).

Figure 7: GSVA scores distributions of samples with more relaxed thresholds for assessing expression consistency of the down-regulated gene set (left) and the up-regulated gene set (right) with the pathway gene expression signature
Low and high expression abundance thresholds for the down-regulated gene set (left) are set to -0.1 and 0.1 respectively, while low and high expression abundance thresholds for the up-regulated gene set (right) are also set to -0.1 and 0.1 respectively. Data is generated from running GSVA on logCPM-normalised ER_data_se1 expression dataset with ER signature (ER_sig_df).

Conversely, expanding the distance between the thresholds would increase the frequency of “Uncertain” pathway activity classifications, as the number of samples meeting the thresholds for consistency and inconsistency would be reduced.

# more stringent thresholds, greater uncertain labels
data_threshs <- data.frame(Geneset=c("Up", "Down"), vline=c(-0.4, 0.4))
plot + geom_vline(xintercept=data_threshs$vline, linetype=2)
Density plots for GSVA scores distributions for samples with relatively more stringent thresholds for assessing consistency of the up-regulated gene set (left) and down-rgulated gene set (right) with the pathway signature. The low and high expression abundance thresholds are -0.4 and 0.4 for both up-regulated and down-regulated gene sets respectively. Data is generated from running GSVA on logCPM-normalised `ER_data_se1` expression dataset with ER signature (`ER_sig_df`).

Figure 8: Density plots for GSVA scores distributions for samples with relatively more stringent thresholds for assessing consistency of the up-regulated gene set (left) and down-rgulated gene set (right) with the pathway signature
The low and high expression abundance thresholds are -0.4 and 0.4 for both up-regulated and down-regulated gene sets respectively. Data is generated from running GSVA on logCPM-normalised ER_data_se1 expression dataset with ER signature (ER_sig_df).

7.4 Classification using absolute GSVA score thresholds

As the distribution of GSVA scores tends to be biomodal rather than Guassian, the user may prefer to use absolute GSVA thresholds for checking consistency of expression abundance of each gene set with pathway signature for each sample. PathAnalyser provides the classify_GSVA_abs function for classifying samples by pathway activity using absolute GSVA score thresholds which can be tuned by the user:

  • up_thresh.high: for high expression abundance for the up-regulated gene-set
  • up_thresh.low: for low expression abundance for the up-regulated gene-set of the gene signature
  • dn_thresh.high: for high expression abundance for the down-regulated gene-set of the gene signature
  • dn_thresh.low: for low expression abundance for the down-regulated gene-set of the gene signature

Note that absolute thresholds are required from the user when running classify_GSVA_abs and can be positive or negative numbers. A summary of the classification is printed to the console highlighting the number of samples in each pathway activity class (“Active”, “Inactive” and “Uncertain”) and the number of samples in total.

classes_df <- classify_GSVA_abs(ER_sig_df, normalized_se, 
                                    up_thresh.low=-0.2, up_thresh.high=0.2, 
                                    dn_thresh.low=-0.2, dn_thresh.high=0.2)
#> Summary of sample classification based on pathway activity:
#> --------------------------------------------------------------
#> Number of samples in each pathway activity class:
#> 
#>    Active  Inactive Uncertain 
#>        30        28         2 
#> 
#> Total number of samples: 60
#> Total number of samples classified: 58

7.5 Classification using percentile GSVA score thresholds

The package also provides a GSVA-based classification method classify_GSVA_percent, which uses percentile ranks as thresholds rather than absolute GSVA thresholds. The default percentile used by the function is 25%, so the high expression abundance thresholds in both gene sets would be 75%, by default and low expression abundance thresholds in both gene sets would be 25%, default. A custom percentile threshold can be provided using the optional thresh_percent parameter. As for the absolute threshold function, a summary of the classification is printed to the console highlighting the number of samples in each pathway activity class (“Active”, “Inactive” and “Uncertain”) and the number of samples in total. Increasing the percentile threshold for classification has the equivalent effect of reducing the distance between the high expression abundance and low expression abundance thresholds for both gene sets, therefore reducing the frequency of “Uncertain” classifications.

# default percentile = 25% (quartile)
classes_df.percent <- classify_GSVA_percent(ER_sig_df, normalized_se)
#> Summary of sample classification based on pathway activity:
#> --------------------------------------------------------------
#> Number of samples in each pathway activity class:
#> 
#>    Active  Inactive Uncertain 
#>        15        15        30 
#> 
#> Total number of samples: 60
#> Total number of samples classified: 30
# custom percentile = 30%
classes_df.percent <- classify_GSVA_percent(ER_sig_df, normalized_se, 
                                            percent_thresh=30)
#> Summary of sample classification based on pathway activity:
#> --------------------------------------------------------------
#> Number of samples in each pathway activity class:
#> 
#>    Active  Inactive Uncertain 
#>        18        18        24 
#> 
#> Total number of samples: 60
#> Total number of samples classified: 36

8 Visualising pathway activity

An interactive PCA plot providing a visual summary of the pathway-based classification can be produced by using the classes_pca function with the normalized data set and predicted classes data frame. Each point represent a sample in the expression data set and are colored according to the predicted activity class (“Active”, “Inactive”, “Uncertain”). Hovering over the sample points displays the sample name, along with predicted pathway activity class and also the relative coordinates of the sample in the first two principal components.

8.1 Generating PCA plot

classes_pca(normalized_se, classes_df, pathway_name = "ER")

9 Evaluating pathway activity classification

This package provides a method calculate_accuracy for evaluating pathway activity classification, which gets the actual labels and the labels classified by the classification methods as parameters; then creates confusion matrix based on them. The first parameter true_labels_source could be file-name, matrix or data frame and contains sample names and their corresponding actual labels.The second parameter classes_df could be matrix or data frame and contains classified labels found by the classification methods classify_GSVA_percent or classify_GSVA_abs. Third parameter pathway specifies with which pathway the labels(actual and classified) are associated. Other parameters display_statistics and display_roc_curve are boolean flags used as optional features to display statistics and ROC Curve respectively. Default values for the flags is FALSE.

true_labels_df <- read.table("Sample_labels.txt", sep="\t", header=T)
confusion_matrix <- calculate_accuracy(true_labels_df, classes_df, 
                                       pathway = "ER", display_statistics=TRUE, 
                                       display_roc_curve=TRUE)
#> [1] "Confusion Matrix for  ER  pathway"
#> [1] "--------------------------------------------------------------"
#>                      Actual Positive Actual Negative Actual Uncertain
#> Prediction Positive               27               3                0
#> Prediction Negative                2              26                0
#> Prediction Uncertain               1               1                0
#> [1] "--------------------------------------------------------------"
#> [1] "Statistics in Confusion Matrix "
#> [1] "--------------------------------------------------------------"
#> [1] "Proportion of classified samples: 96.67"
#> [1] "Accuracy amongst classified samples: 91.38"
#> [1] "True Positive(TP): 27"
#> [1] "True Negative(TN): 26"
#> [1] "False Negative(FN): 2"
#> [1] "False Positive(FP): 3"
#> [1] "--------------------------------------------------------------"
#> [1] "True Positive Rate(TPR)(sensitivity)(Recall): 93.10"
#> [1] "True Negative Rate(TNR)(specificity): 89.66"
#> [1] "Precision (Positive predictive value): 90.00"
#> [1] "False Positive Rate(FPR): 10.34"
#> [1] "False Negative Rate(FNR): 6.90"
#> [1] "--------------------------------------------------------------"
#>  Actual Positive Actual Negative Actual Uncertain
#>  Min.   : 1.0    Min.   : 1.0    Min.   :0       
#>  1st Qu.: 1.5    1st Qu.: 2.0    1st Qu.:0       
#>  Median : 2.0    Median : 3.0    Median :0       
#>  Mean   :10.0    Mean   :10.0    Mean   :0       
#>  3rd Qu.:14.5    3rd Qu.:14.5    3rd Qu.:0       
#>  Max.   :27.0    Max.   :26.0    Max.   :0

10 Example: Assessing ER pathway activity in a breast cancer RNAseq dataset

To demonstrate the standard workflow of assessing pathway activity in a transcriptomic data set using PathAnalyser, we will use PathAnalyser to classify samples in an RNA seq data set according to ER pathway activity as (“Active”, “Inactive” or “Uncertain”) for the ER pathway. The ER pathway status of tumors is of particular interest since, activation of the pathway is correlated sensitivity to endocrine therapy and disease prognosis more broadly. The data set collected from The Genome Atlas Project (TCGA) contains raw read counts for over 20,000 genes (almost the complete human transcriptome) for 20 primary breast cancer samples. Ten samples were shown have the ER pathway active (ER+), while the remaining ten samples were reported to be inactive for the ER pathway (ER-).

10.1 Read RNASeq gene expression dataset and ER pathway signature data

First, the gene expression matrix data set text file which contains raw read counts for the 20 breast cancer samples and ER gene signature text files (up-regulated and down-regulated gene set files) are read into the expression matrix and gene signature data frame data types respectively. The gene expression matrix contains raw read counts for 20,124 genes and 20 samples. Additionally, the ER signature compiled from the Sensitivity to Endocrine Therapy (SET) index proposed by Symanns et al. (2011) contains 160 genes of which 59 are down-regulated and 101 are up-regulated.

# Load transcriptomic data set (gene expression matrix of samples)
data_se <- read_expression_data("toy_data.txt")
dim(data_se)
#> [1] 20124    20
head(data_se)
#>         TCGA.AR.A0TY TCGA.AO.A124 TCGA.BH.A0BG TCGA.AN.A0G0 TCGA.AO.A0J8
#> OR5H15             0            0            0            0            0
#> SPANXN2            0            0            0            0            0
#> TAF7           16841         6974         8211         5586        12210
#> TMEM107          189          851          581          433         1298
#> CAMK2A            10           20           18           27           19
#> TRAM1L1          233          198          426          328         1044
#>         TCGA.D8.A1XZ TCGA.AO.A0JE TCGA.E9.A5FL TCGA.LL.A8F5 TCGA.C8.A12V
#> OR5H15             0            0            0            0            0
#> SPANXN2            0            0            0            0            0
#> TAF7            8627         9138         3219         7663         6867
#> TMEM107          629          253          226         1265          385
#> CAMK2A            30           47           72           45           18
#> TRAM1L1          385          467          153          342           33
#>         TCGA.A2.A0CM TCGA.AC.A3TN TCGA.AC.A3QQ TCGA.A2.A0YF TCGA.E2.A15F
#> OR5H15             0            0            0            0            0
#> SPANXN2            0            0            0            0            0
#> TAF7            4416         2869          718        14649         7106
#> TMEM107          332          660          230         1738          319
#> CAMK2A            29           34           25            4           48
#> TRAM1L1          212          482           28          428          507
#>         TCGA.AO.A1KQ TCGA.BH.A0E0 TCGA.BH.A18G TCGA.5L.AAT0 TCGA.A8.A07C
#> OR5H15             0            0            0            0            0
#> SPANXN2            0            0            0            0            0
#> TAF7            6351         3150         5278         4839         9226
#> TMEM107          633          234          245          284          565
#> CAMK2A            32           34           23           43            9
#> TRAM1L1          169           71          220          272          615
# read signature data from the two individual gene set files for up-regulated
# and down-regulated gene sets
ER_sig <- read_signature_data("ESR1_UP.v1._UP.grp", "ESR1_DN.v1_DN.grp")
dim(ER_sig)
#> [1] 160   2

10.2 QC and pre-processing of the expression dataset and ER signature data

As the gene expression matrix file contained raw RNASeq counts, the data must be normalized prior to passing the matrix to any of the classification functions. Normalization of read counts is necessary to account for differences in sequencing depths among the libraries. The PathAnalyser log_cpm_transformation function can be used to perform a log counts per million (CPM) transformation on the gene expression matrix. Furthermore, log_cpm graphically confirms the logCPM transformation has been performed correctly since the box plots displayed by the function, representing the logCPM values for each sample, are more aligned following the logCPM transformation compared to before the transformation.

# Transforming matrix with log cpm transformation and sanity check of the transformation
normalized_se <- log_cpm_transformation(data_se)

To ensure that the gene names in the gene signatures and gene expression data set are consistent, we use the check_signature_vs_dataset gene names to filter out genes from the gene expression matrix that are not in signature and those genes which are not expressed in at least 10% of samples. The output of running the function on the normalised dataset and ER signature shows that 147 genes were retained in the normalised dataset with 13 genes being dropped either due to expression in less than 10% of samples, or due to being absent in the ER signature.

normalized_se <- check_signature_vs_dataset(normalized_se, ER_sig)
#> Number of feature present in expression dataset: 146

dim(normalized_se)
#> [1] 146  20

10.3 Classifying breast cancer samples based on ER pathway activity.

Once the expression matrix is normalised to logCPM and genes with insufficient counts or those that do not feature in the ER signature are removed, the expression data set is ready for classification alongside the ER signature. To acquire an idea of the distribution of GSVA scores, the gene expression matrix can be passed as an argument along with the ER signature to the gsva_scores_dist. A bimodal density plot is often produced for the up-regulated and down-regulated gene set of the ER signature, reflective of the often binaric nature of gene expression in samples of a dataset discussed above. The peak centered around the GSVA score of -0.8 for the down-regulated gene set and the peak around 0.8 for the down-regulated gene set represent the mode of samples with the low expression abundance of the down-regulated genes of the ER signature and high abundance of the up-regulated genes of the ER signature respectively. Similarly, for the up-regulated genes of the ER pathway signature, the peak situated at around a GSVA score of -0.7 and and around the GSVA score of 0.9 represent samples of low expression abundance and high expression abundance of the up-regulated genes of the ER pathway signature respectively.

gsva_scores_dist(ER_sig, normalized_se)

Samples in a gene expression matrix can be classified according to evidence of expression consistency with the up-regulated and down-regulated gene sets of the ER signature using GSVA by using absolute thresholds for high and low expression for each gene set of the ER signature via classify_GSVA_abs or percentile threshold for high and low expression for each gene set of the ER signature via classify_GSVA_percent. Here, we use the default percentile thresholds of 25% to classify samples as having high or low expression abundance with the up-regulated and down-regulated gene sets of the ER gene signature. The R console output of running the classify_GSVA_percent on our data set and ER signature, shows that of the 20 samples in our data set, 4 samples were classified as active and 5 samples were classified as inactive for the ER pathway. Eleven samples were classified as uncertain. A data frame is produced with names of the 20 samples as the first column and their corresponding predicted ER pathway activity classes as the second column.

classes_df <- classify_GSVA_percent(ER_sig, normalized_se)
#> Summary of sample classification based on pathway activity:
#> --------------------------------------------------------------
#> Number of samples in each pathway activity class:
#> 
#>    Active  Inactive Uncertain 
#>         4         5        11 
#> 
#> Total number of samples: 20
#> Total number of samples classified: 9
head(classes_df)
#>         sample     class
#> 1 TCGA.AC.A3TN    Active
#> 2 TCGA.AO.A0J8    Active
#> 3 TCGA.A2.A0YF    Active
#> 4 TCGA.5L.AAT0    Active
#> 5 TCGA.E2.A15F Uncertain
#> 6 TCGA.AO.A1KQ Uncertain

Depending on the user’s classification requirements the thresholds can be adjusted. For example, if the user would like to reduce the number of “Uncertain” classification samples, the user could provide 50% percentile threshold as an argument e.g.:

classes_df_50p <- classify_GSVA_percent(ER_sig, normalized_se, 
                                        percent_thresh=50)
#> Summary of sample classification based on pathway activity:
#> --------------------------------------------------------------
#> Number of samples in each pathway activity class:
#> 
#>    Active  Inactive Uncertain 
#>        10        10         0 
#> 
#> Total number of samples: 20
#> Total number of samples classified: 20

From the output of running the classification function with 50% percentile, it is evident that no samples were classified as “Uncertain”, with 10 samples classified as “Active” and 10 samples classified as “Inactive” by the classification algorithm.

Alternatively, the breast cancer samples in the data set could be classified by ER pathway activity, using absolute GSVA score thresholds for the up-regulated and down-regulated gene sets of the ER signature respectively.

classes_df_abs <- classify_GSVA_abs(ER_sig, normalized_se, up_thresh.low = -0.2, 
                                    up_thresh.high = 0.2, dn_thresh.low = -0.2, 
                                    dn_thresh.high = 0.2)
#> Summary of sample classification based on pathway activity:
#> --------------------------------------------------------------
#> Number of samples in each pathway activity class:
#> 
#>    Active  Inactive Uncertain 
#>         9        10         1 
#> 
#> Total number of samples: 20
#> Total number of samples classified: 19

10.4 Visualizing ER pathway activity classification of the breast cancer samples

To acquire a visual summary of the ER pathway-based classification of our 20 breast cancer samples, we can run the classes_pca function to produce an interactive PCA plot of the samples colored by the predicted ER pathway activity status (active, inactive, uncertain).

classes_pca(normalized_se, classes_df, pathway_name = "ER")

The PCA plot shows the samples predicted as active and inactive form relatively tight clusters in the first principal component (PC), as would be expected since the inter-variation within the active and inactive classes would be smaller than between the different classes. Interestingly, the active and inactive clusters are in opposite quadrants in the first PC indicating the samples have been separated according to pathway activity well in the first PC. Furthermore, samples classified as uncertain are more scattered in the first PC, with a few samples present in the active / inactive clusters. Overall the PCA plot shows the samples have clustered well according to the predicted pathway activity classes. Sample names and their corresponding predicted pathway activity class can be displayed for a given sample on the plot by hovering over a sample point using a mouse pointer.

10.5 Assessing the accuracy of the pathway-based classification

A more detailed assessment of the ER pathway based classification of the breast cancer samples can be performed by providing the true pathway activity class labels for each sample, along with predicted pathway activity classes classes_df to the calculate_accuracy function. These “true labels” must only contain (“Active”, “Inactive or”Uncertain”) for the calculate_accuracy function to correctly assess the pathway-based classification. The pathway parameter is set to “ER” since ER pathway activity is being assessed. The function outputs a confusion matrix reporting in tabular form, the number of true positives (TP), true negative(TN), false positives (FP), false negatives (FN) and also a breakdown of the number of uncertain classifications and their actual pathway activity status. Further classification evaluation using the display_statistics reveals an accuracy of 100% of among the 9 classified samples, the 4 samples classified as ER positive were actually ER positive and similarly, the 5 samples classified as ER negative were indeed ER negative.

# read true pathway activity class labels for data set samples
true_labels_df <- read.table("Sample_labels.txt", 
                             header=TRUE, sep = "\t")
# assess accuracy and generate confusion matrix for classification
confusion_matrix <- calculate_accuracy(true_labels_df, classes_df, 
                                       pathway = "ER")
#> [1] "Confusion Matrix for  ER  pathway"
#> [1] "--------------------------------------------------------------"
#>                      Actual Positive Actual Negative Actual Uncertain
#> Prediction Positive                4               0                0
#> Prediction Negative                0               5                0
#> Prediction Uncertain               6               5                0
#> [1] "--------------------------------------------------------------"
# detail breakdown of classification evaluation (accuracy, % classified etc)
confusion_matrix <- calculate_accuracy(true_labels_df, classes_df, 
                                       pathway = "ER", display_statistics = T)
#> [1] "Confusion Matrix for  ER  pathway"
#> [1] "--------------------------------------------------------------"
#>                      Actual Positive Actual Negative Actual Uncertain
#> Prediction Positive                4               0                0
#> Prediction Negative                0               5                0
#> Prediction Uncertain               6               5                0
#> [1] "--------------------------------------------------------------"
#> [1] "Statistics in Confusion Matrix "
#> [1] "--------------------------------------------------------------"
#> [1] "Proportion of classified samples: 45.00"
#> [1] "Accuracy amongst classified samples: 100.00"
#> [1] "True Positive(TP): 4"
#> [1] "True Negative(TN): 5"
#> [1] "False Negative(FN): 0"
#> [1] "False Positive(FP): 0"
#> [1] "--------------------------------------------------------------"
#> [1] "True Positive Rate(TPR)(sensitivity)(Recall): 100.00"
#> [1] "True Negative Rate(TNR)(specificity): 100.00"
#> [1] "Precision (Positive predictive value): 100.00"
#> [1] "False Positive Rate(FPR): 0.00"
#> [1] "False Negative Rate(FNR): 0.00"
#> [1] "--------------------------------------------------------------"
#>  Actual Positive Actual Negative Actual Uncertain
#>  Min.   :0.000   Min.   :0.000   Min.   :0       
#>  1st Qu.:2.000   1st Qu.:2.500   1st Qu.:0       
#>  Median :4.000   Median :5.000   Median :0       
#>  Mean   :3.333   Mean   :3.333   Mean   :0       
#>  3rd Qu.:5.000   3rd Qu.:5.000   3rd Qu.:0       
#>  Max.   :6.000   Max.   :5.000   Max.   :0

In addition to a detail summary of the classification assessment, we can also view a receiver operating curve (ROC) curve for our classification, which visualizes the sensitivity and specificity of the data set classification by fitting a logistic regression model to the binary ER pathway activity classifications (active, inactive):

# detail breakdown of classification evaluation (accuracy, % classified etc)
confusion_matrix <- calculate_accuracy(true_labels_df, classes_df, 
                                       pathway = "ER", display_statistics = T,
                                       display_roc_curve = T)
#> [1] "Confusion Matrix for  ER  pathway"
#> [1] "--------------------------------------------------------------"
#>                      Actual Positive Actual Negative Actual Uncertain
#> Prediction Positive                4               0                0
#> Prediction Negative                0               5                0
#> Prediction Uncertain               6               5                0
#> [1] "--------------------------------------------------------------"
#> [1] "Statistics in Confusion Matrix "
#> [1] "--------------------------------------------------------------"
#> [1] "Proportion of classified samples: 45.00"
#> [1] "Accuracy amongst classified samples: 100.00"
#> [1] "True Positive(TP): 4"
#> [1] "True Negative(TN): 5"
#> [1] "False Negative(FN): 0"
#> [1] "False Positive(FP): 0"
#> [1] "--------------------------------------------------------------"
#> [1] "True Positive Rate(TPR)(sensitivity)(Recall): 100.00"
#> [1] "True Negative Rate(TNR)(specificity): 100.00"
#> [1] "Precision (Positive predictive value): 100.00"
#> [1] "False Positive Rate(FPR): 0.00"
#> [1] "False Negative Rate(FNR): 0.00"
#> [1] "--------------------------------------------------------------"
#>  Actual Positive Actual Negative Actual Uncertain
#>  Min.   :0.000   Min.   :0.000   Min.   :0       
#>  1st Qu.:2.000   1st Qu.:2.500   1st Qu.:0       
#>  Median :4.000   Median :5.000   Median :0       
#>  Mean   :3.333   Mean   :3.333   Mean   :0       
#>  3rd Qu.:5.000   3rd Qu.:5.000   3rd Qu.:0       
#>  Max.   :6.000   Max.   :5.000   Max.   :0

In concordance with the statistical classification evaluation metrics, the ROC curve also shows that the samples were classified well according to ER pathway activity by the curve reaching the top left corner of the plot (indicating a high true positive rate and a low false positive rate). The nine points represent the nine samples classified as “active” / “inactive” for the ER pathway.

11 Session Info

The output of sessionInfo upon which this file was generated:

sessionInfo()
#> R version 4.1.3 (2022-03-10)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
#>  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
#>  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] ggplot2_3.3.5           PathAnalyser_0.0.0.9000 GSVA_1.42.0            
#> [4] BiocStyle_2.22.0       
#> 
#> loaded via a namespace (and not attached):
#>   [1] plyr_1.8.7                  lazyeval_0.2.2             
#>   [3] GSEABase_1.56.0             crosstalk_1.2.0            
#>   [5] BiocParallel_1.28.3         usethis_2.1.5              
#>   [7] GenomeInfoDb_1.30.1         digest_0.6.29              
#>   [9] htmltools_0.5.2             fansi_1.0.3                
#>  [11] magrittr_2.0.3              memoise_2.0.1              
#>  [13] ScaledMatrix_1.2.0          NCmisc_1.1.6               
#>  [15] limma_3.50.1                remotes_2.4.2              
#>  [17] Biostrings_2.62.0           annotate_1.72.0            
#>  [19] matrixStats_0.61.0          prettyunits_1.1.1          
#>  [21] colorspace_2.0-3            blob_1.2.2                 
#>  [23] xfun_0.30                   dplyr_1.0.8                
#>  [25] callr_3.7.0                 crayon_1.5.1               
#>  [27] RCurl_1.98-1.6              jsonlite_1.8.0             
#>  [29] graph_1.72.0                glue_1.6.2                 
#>  [31] gtable_0.3.0                zlibbioc_1.40.0            
#>  [33] XVector_0.34.0              DelayedArray_0.20.0        
#>  [35] pkgbuild_1.3.1              BiocSingular_1.10.0        
#>  [37] Rhdf5lib_1.16.0             SingleCellExperiment_1.16.0
#>  [39] BiocGenerics_0.40.0         HDF5Array_1.22.1           
#>  [41] scales_1.1.1                futile.options_1.0.1       
#>  [43] DBI_1.1.2                   edgeR_3.36.0               
#>  [45] Rcpp_1.0.8.3                viridisLite_0.4.0          
#>  [47] xtable_1.8-4                bit_4.0.4                  
#>  [49] rsvd_1.0.5                  stats4_4.1.3               
#>  [51] htmlwidgets_1.5.4           httr_1.4.2                 
#>  [53] RColorBrewer_1.1-3          ellipsis_0.3.2             
#>  [55] pkgconfig_2.0.3             XML_3.99-0.9               
#>  [57] farver_2.1.0                sass_0.4.1                 
#>  [59] locfit_1.5-9.5              utf8_1.2.2                 
#>  [61] tidyselect_1.1.2            labeling_0.4.2             
#>  [63] rlang_1.0.2                 reshape2_1.4.4             
#>  [65] AnnotationDbi_1.56.2        munsell_0.5.0              
#>  [67] tools_4.1.3                 cachem_1.0.6               
#>  [69] cli_3.2.0                   generics_0.1.2             
#>  [71] RSQLite_2.2.12              devtools_2.4.3             
#>  [73] evaluate_0.15               stringr_1.4.0              
#>  [75] fastmap_1.1.0               yaml_2.3.5                 
#>  [77] processx_3.5.3              knitr_1.38                 
#>  [79] bit64_4.0.5                 fs_1.5.2                   
#>  [81] purrr_0.3.4                 KEGGREST_1.34.0            
#>  [83] sparseMatrixStats_1.6.0     formatR_1.12               
#>  [85] proftools_0.99-3            brio_1.1.3                 
#>  [87] compiler_4.1.3              rstudioapi_0.13            
#>  [89] plotly_4.10.0               png_0.1-7                  
#>  [91] testthat_3.1.3              tibble_3.1.6               
#>  [93] bslib_0.3.1                 stringi_1.7.6              
#>  [95] highr_0.9                   ps_1.6.0                   
#>  [97] futile.logger_1.4.3         desc_1.4.1                 
#>  [99] lattice_0.20-45             Matrix_1.4-1               
#> [101] vctrs_0.4.0                 pillar_1.7.0               
#> [103] lifecycle_1.0.1             rhdf5filters_1.6.0         
#> [105] BiocManager_1.30.16         jquerylib_0.1.4            
#> [107] data.table_1.14.2           bitops_1.0-7               
#> [109] irlba_2.3.5                 GenomicRanges_1.46.1       
#> [111] R6_2.5.1                    bookdown_0.25              
#> [113] IRanges_2.28.0              sessioninfo_1.2.2          
#> [115] reader_1.0.6                lambda.r_1.2.4             
#> [117] assertthat_0.2.1            pkgload_1.2.4              
#> [119] rhdf5_2.38.1                SummarizedExperiment_1.24.0
#> [121] rprojroot_2.0.3             withr_2.5.0                
#> [123] S4Vectors_0.32.4            GenomeInfoDbData_1.2.7     
#> [125] parallel_4.1.3              VennDiagram_1.7.1          
#> [127] grid_4.1.3                  beachmat_2.10.0            
#> [129] tidyr_1.2.0                 rmarkdown_2.13             
#> [131] DelayedMatrixStats_1.16.0   MatrixGenerics_1.6.0       
#> [133] pROC_1.18.0                 Biobase_2.54.0

12 References

Hänzelmann, S., Castelo, R. & Guinney, J. GSVA: gene set variation analysis for microarray and RNA-Seq data. BMC Bioinformatics 14, 7 (2013). https://doi.org/10.1186/1471-2105-14-7

Smid, M., Wang, Y., Zhang, Y., Sieuwerts, A.M., Yu, J., Klijn, J.G., Foekens, J.A. and Martens, J.W., 2008. Subtypes of breast cancer show preferential site of relapse. Cancer research, 68(9), pp.3108-3114.

Symmans, W.F., Hatzis, C., Sotiriou, C., Andre, F., Peintinger, F., Regitnig, P., Daxenbichler, G., Desmedt, C., Domont, J., Marth, C. and Delaloge, S., 2010. Genomic index of sensitivity to endocrine therapy for breast cancer. Journal of clinical oncology, 28(27), p.4111. https://dx.doi.org/10.1200%2FJCO.2010.28.4273